Learning from Relevant Documents in Large Scale Routing Retrieval
نویسندگان
چکیده
The normal practice of selecting relevant documents for training routing queries is to either use all relevants or the 'best n' of them after a (retrieval) ranking operation with respect to each query. Using all relevants can introduce noise and ambiguities in training because documents can be long with many irrelevant portions. Using only the 'best n' risks leaving out documents that do not resemble a query. Based on a method of segmenting documents into more uniform size subdocuments, a better approach is to use the top ranked subdocument of every relevant. An alternative selection strategy is based on document properties without ranking. We found experimentally that short relevant documents are the quality items for training. Beginning portions of longer relevants are also useful. Using both types provides a strategy that is effective and efficient. 1. I N T R O D U C T I O N In ad hoe Information Retrieval (IR) one employs a user-supplied free-text query as a clue to match against a textbase and rank documents for retrieval. In a routing environment, one has the additional option to consult a user need's history to obtain a set of previously judged documents. This set may be used with an automatic learning algorithm to help refine or augment the user-supplied free-text query, or even to define the query without the user description. We focus on employing the judged relevant set in this paper. (Judged nonrelevant documents have not been found to be useful in our model.) For this option, one needs to consider two separate processes: (1) selecting the appropriate relevant documents or portions of them for training; and (2) selecting the appropriate terms from these documents, expand the query and then effectively weighting these terms for the query. It is well-known from TREC and other experiments [1,2,3,4,5,6,7,8] that process (2) can improve routing results substantially. However, process (1) is normally not given much consideration. One either uses all the relevant documents, or employs the best n of them after ranking with respect to the query under consideration. However, over time in a large scale environment, hundreds and thousands of such relevant documents may accumulate for each user need. A strategy of which and what parts of the relevant documents are to be employed for training needs to be considered. Would portions of relevant documents be sufficient? One reason for using a portion is that many documents can be long and may contain extraneous paragraphs and sections that are irrelevant. Using them for learning may contribute ambiguities during the term selection, query expansion and weighting processes. The problem is that current relevance information gathering is for whole documents only, and not at a more specific level such as which sentence or paragraph that is relevant. This problem would be alleviated if users are diligent and indicate the relevant components of a document that are actually relevant. However, this could be a burden that some users may want to avoid. It is therefore useful to have an algorithm to locate the most useful relevant components for training purposes. Another reason to use only portions of the relevants is consideration of efficiency: one would like to avoid processing long documents when most of it is irrelevant, or decrease the number of documents to be processed. This investigation concerns exploring ways to effectively choose a subset of documents for training a given set of routing queries. 2. P I R C S R E T R I E V A L S Y S T E M PIRCS (acronym for Probabilistic Indexing and Retrieval -ComponentsSystem) is a network-based system implementing a Bayesian decision approach to
منابع مشابه
Information Retrieval from Large Textbases
Our objective is to enhance the effectiveness of retrieval and routing operations for large scale textbases. Retrieval concerns the processing of ad hoc queries against a static document collection, while muting concerns the processing of static, trained queries against a document stream. Both may be viewed as trying to rank relevant answer documents high in the output. Our text processing and ...
متن کاملTREC 11 Experiments at NII: The Effects of Virtual Relevant Documents in Batch Filtering
Researches on document retrieval, text categorization and routing have shown the effects of learning by sampling relevant documents or non-relevant document from training set. Allan et al. (1995) considered only the top K non-relevant documents, which is the same number of all known relevant documents in the training set to learn a routing query. This is motivated by the need to have a balance ...
متن کاملPIRCS: a Network-Based Document Routing and Retrieval System
Our objective is to enhance the effectiveness and efficiency of ad hoe and routing retrieval for large scale textbases. Effective retrieval means ranking relevant answer documents of a user's information need high on the output list. Our text processing and retrieval system PIRCS ( Probabilisitc Indexing and Retrieval -ComponentsSystem) handles English text in a domain independent fashion, and ...
متن کاملSession 12: Information Retrieval
Research in information retrieval is enjoying renewed interest by many different communities. Commercial retrieval systems, which in the past have concentrated on Boolean pat tern matching methodologies, are beginning to look into more sophist icated search methods, including complex stat istical and /o r natural language processing systems. This has spurred new interest in research in informat...
متن کاملOn linear mixture of expert approaches to information retrieval
Knowledge intensive organizations have vast array of information contained in large document repositories. With the advent of E-commerce and corporate intranets/extranets, these repositories are expected to grow at a fast pace. This explosive growth has led to huge, fragmented, and unstructured document collections. Although it has become easier to collect and store information in document coll...
متن کامل